Summary

The analysis provides a complete rundown of different visualizations and their insights into the dataset. Our first bar plot demonstrates a gap in region counts, this can be seen as the Southeast region has the highest count. Next, with our BMI histogram one can notice an equal distribution, while the charges histogram on the other hand is skewed left. This box plot also highlights regional inequalities in BMI distribution. This can be observed within the Southeast region that has a high IQR and contains outliers within the data. Another visual that can be observed is a scatterplot that shows a positive correlation between age and charges, where different colors are used to indicate who is, or is not a smoker. Our pie chart then demonstrates how cases with no or one child are more common, unlike cases with more children that are less common. When analyzing our box plot we can see numerous outliers within different numbers of children. This can be seen more often in cases with zero, one, two, and three children, which can be a result of variations within the data. Overall, the following visualizations provide insights into the variations amongst different regions and relationships amongst variables. With this information it allows for further analysis and improvements to be made.

Question One

column

#1. Read the data file insurance.csv using the read_csv() function in tidyverse.

Column

```’{r} library(readr) insurance_data <- read_csv(“insurance.data.csv”) head(insurance_data) ````

Question Two

column

#2. Get a glimpse of the data and indicate the number of observations and the number of variables in the data.

Column

# A tibble: 6 × 7
    age sex      bmi children smoker region    charges
  <dbl> <chr>  <dbl>    <dbl> <chr>  <chr>       <dbl>
1    19 female  27.9        0 yes    southwest  16885.
2    18 male    33.8        1 no     southeast   1726.
3    28 male    33          3 no     southeast   4449.
4    33 male    22.7        0 no     northwest  21984.
5    32 male    28.9        0 no     northwest   3867.
6    31 female  25.7        0 no     southeast   3757.
[1] 1338    7

Question Three

column

#3. Create a bar plot of region. Use a few sentences to summarize your finding based on the plot.

When looking at the bar plot for the distribution of regions one may notice that the largest region southeast with a count over 350, while the other three all have a count just a little over 300. One may also notice that the mean distribution seems to be somewhere between a count of 350 and 325.

Column

Question Four

column

#4. Create a stack bar plot such that region is on the x axis and each bar shows the distribution of smoker in that region. You should make sure that your y axis shows percents.

Column

Question Five

column

#5. Create a histogram of bmi. Discuss the distribution of the histogram.

When looking at the histogram for bmi one may notice a unilateral distribution as the information is evenly spread throughout, with no left or right skew being observed.

Column

Question Six

column

#6. Create a histogram of charges. Discuss the distribution of the histogram.

For the following histogram you can see the visual is skewed left, with the maximum frequency being around 125 when the charges near zero.

Column

Question Seven

column

#7. Create a boxplot that shows the distribution of bmi based on the region. Discuss what you find based on the boxplot. (Hint: you need to have x and y variables in mapping)

After taking a look at the following boxplot, you can see the mean value for the distribution of BMI based on region is a little of 30. We can also see once again the southeast has the largest IQR compared to other regions, but unlike other visuals a boxplot allows us to see outliers in the data. These can be seen with the dots placed above certain boxplots.

Column

Question Eight

column

#8. Create a scatterplot that shows the relationship between age (independent variable) and charges (dependent variable). Comment on the scatterplot.

When looking at the scatter plot there seems to be a positive relationship between age and charges. This can be seen with how the dots move slightly vertical as they move to the right on the chart. It is not a huge shift so this could possibly mean the relationship is moderate or potentially weak.

Column

Question Nine

column

#9. You should find that it seems “charges” could be classified into several groups. Let’s create a scatterplot that has age as the independent variable (x) and has smoker as another categorical variable (color), and the response variable is charges. Comment on the scatterplot.

Compared to the other scatterplot the varying colors makes the data easier to read. Not only does it now show a postive relationship, but it specifies the relationship according to whether or not they are a smoker. By doing this it changes the colors, and allows a line of dots to be seen between smokers and nonsmokers near the 2000 charges area.

Column

Question Ten

column

#10. Now, create two data frames by subsetting insurance data as follows:

smoker <- insurance[insurance$smoker==“yes”]

nonsmoker <- insurance[insurance$smoker==“no”]

Column

# A tibble: 274 × 7
     age sex      bmi children smoker region    charges
   <dbl> <chr>  <dbl>    <dbl> <chr>  <chr>       <dbl>
 1    19 female  27.9        0 yes    southwest  16885.
 2    62 female  26.3        0 yes    southeast  27809.
 3    27 male    42.1        0 yes    southeast  39612.
 4    30 male    35.3        0 yes    southwest  36837.
 5    34 female  31.9        1 yes    northeast  37702.
 6    31 male    36.3        2 yes    southwest  38711 
 7    22 male    35.6        0 yes    southwest  35586.
 8    28 male    36.4        1 yes    southwest  51195.
 9    35 male    36.7        1 yes    northeast  39774.
10    60 male    39.9        0 yes    southwest  48173.
# ℹ 264 more rows
# A tibble: 1,064 × 7
     age sex      bmi children smoker region    charges
   <dbl> <chr>  <dbl>    <dbl> <chr>  <chr>       <dbl>
 1    18 male    33.8        1 no     southeast   1726.
 2    28 male    33          3 no     southeast   4449.
 3    33 male    22.7        0 no     northwest  21984.
 4    32 male    28.9        0 no     northwest   3867.
 5    31 female  25.7        0 no     southeast   3757.
 6    46 female  33.4        1 no     southeast   8241.
 7    37 female  27.7        3 no     northwest   7282.
 8    37 male    29.8        2 no     northeast   6406.
 9    60 female  25.8        0 no     northwest  28923.
10    25 male    26.2        0 no     northeast   2721.
# ℹ 1,054 more rows

Question Eleven

column

#11. Create a scatterplot that has age as the independent variable (x) and the response variable is charges using the data frame smoker. Then add the smooth line. Comment on the plot. Does it make sense to use the smooth line to summarize the relationship between age of clients and the corresponding charges? Why?

No because here we are using categorical data/variables, if the data/variables were continuous then it would make sense to use a smooth line to summarize the relationship.

Column

Question Twelve

column

#12. Repeat Question 11 using the data frame nonsmoker.

Column

Question Thirteen

column

#13. Based on the finding you have on Questions 11 & 12, propose what you might do next if you want to model charges using other variables in this data.

One area of interest I believe would be using the variable BMI. I feel that by uisng BMI it can find another variable that could have influence on the number of charges. This influence could be either a postive, negative, or no relationship.

Column

Question Fourteen

column

#14. Create a pie chart of children. Use a few sentences to summarize your finding based on the plot. (Hint: You need to convert the variable to a categorical variable first)

After observing the pie chart one may find that the two largest distributions are zero and one child as they take up about 60 to 70 percent of the pie chart. On the other hand 1 2, 3, 4 and more children were the lower distributions, taking up just about 30 to 40 percent of the pie chart.

Column

Question Fifteen

column

#15. Create a boxplot that shows the distribution of charges based on the number of children. Discuss what you find based on the boxplot.

At first glance of the boxplot the first thing that stands out is the number of outliers in the data. This can be seen especially with zero children where there are numerous outliers, as shown with the dots on the boxplot. Along with zero children, one, two, and three children also had a concerning amount of outliers within the data. One may notice the similarities in data that having two or three children share as both share outliers, and have both similar IQR and ranges. After glancing at all the following IQR’s one may also notice that most of the data for each number of child seems to be skewed left as most have their median near the bottom of their IQR.

Column

---
title: "Assignment 7"
author: "Luke Keirn"
date: "2024-03-14"
output: 
  flexdashboard::flex_dashboard:
    source_code: embed
    orientation: columns
    vertical_layout: fill
---

```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
library(DT)
library(plotly)
insurance_data <- read_csv("insurance.data.csv")
```


Summary
===

The analysis provides a complete rundown of different visualizations and their insights into the dataset. Our first bar plot demonstrates a gap in region counts, this can be seen as the Southeast region has the highest count. Next, with our BMI histogram one can notice an equal distribution, while the charges histogram on the other hand is skewed left. This box plot also highlights regional inequalities in BMI distribution. This can be observed within the Southeast region that has a high IQR and contains outliers within the data. Another visual that can be observed is a scatterplot that shows a positive correlation between age and charges, where different colors are used to indicate who is, or is not a smoker. Our pie chart then demonstrates how cases with no or one child are more common, unlike cases with more children that are less common. When analyzing our box plot we can see numerous outliers within different numbers of children. This can be seen more often in cases with zero, one, two, and three children, which can be a result of variations within the data. Overall, the following visualizations provide insights into the variations amongst different regions and relationships amongst variables.  With this information it allows for further analysis and improvements to be made. 


Question One
===

column {data-width=450}
---

#1. Read the data file `insurance.csv` using the `read_csv()` function in `tidyverse`.

Column {.tabset data-wdith=550}
---

```'{r}
library(readr)
insurance_data <- read_csv("insurance.data.csv")
head(insurance_data)
````

Question Two
===

column {data-width=450}
---

#2. Get a glimpse of the data and indicate the number of observations and the number of variables in the data.

Column {.tabset data-wdith=550}
---

````{r glimpse_num_obs_num_var}
head(insurance_data)
dim(insurance_data)
````

Question Three
===

column {data-width=450}
---

#3. Create a bar plot of region. Use a few sentences to summarize your finding based on the plot.

When looking at the bar plot for the distribution of regions one may notice that the largest region southeast with a count over 350, while the other three all have a count just a little over 300. One may also notice that the mean distribution seems to be somewhere between a count of 350 and 325.

Column {.tabset data-wdith=550}
---

````{r}
library(ggplot2)
ggplot(data = insurance_data, aes(x = region)) +
  geom_bar() +
  labs(title = "Distribution of Regions",
       x = "Region",
       y = "Count")
````

Question Four
===

column {data-width=450}
---

#4. Create a stack bar plot such that region is on the x axis and each bar shows the distribution of smoker in that region. You should make sure that your y axis shows percents.

Column {.tabset data-wdith=550}
---

````{r}
library(ggplot2)
library(dplyr)
smoker_percent <- insurance_data %>%
  group_by(region, smoker) %>%
  summarise(count = n()) %>%
  mutate(percent = count / sum(count) * 100)
ggplot(smoker_percent, aes(x = region, y = percent, fill = smoker)) +
  geom_bar(stat = "identity") +
  labs(title = "Distribution of Smokers by Region",
       x = "Region",
       y = "Percentage") +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal()
````

Question Five
===

column {data-width=450}
---

#5. Create a histogram of bmi. Discuss the distribution of the histogram.

When looking at the histogram for bmi one may notice a unilateral distribution as the information is evenly spread throughout, with no left or right skew being observed. 

Column {.tabset data-wdith=550}
---

````{r}
library(ggplot2)
ggplot(insurance_data, aes(x = bmi)) +
  geom_histogram(binwidth = 2, fill = "blue", color = "black") +
  labs(title = "Distribution of BMI",
       x = "BMI",
       y = "Frequency") +
  theme_minimal()
````

Question Six
===

column {data-width=450}
---

#6. Create a histogram of charges. Discuss the distribution of the histogram.

For the following histogram you can see the visual is skewed left, with the maximum frequency being around 125 when the charges near zero. 

Column {.tabset data-wdith=550}
---

````{r}
library(ggplot2)
ggplot(insurance_data, aes(x = charges)) +
  geom_histogram(binwidth = 1000, fill = "blue", color = "black") +
  labs(title = "Distribution of Charges",
       x = "Charges",
       y = "Frequency") +
  theme_minimal()
````

Question Seven
===

column {data-width=450}
---

#7. Create a boxplot that shows the distribution of bmi based on the region. Discuss what you find based on the boxplot. (Hint: you need to have x and y variables in mapping)

After taking a look at the following boxplot, you can see the mean value for the distribution of BMI based on region is a little of 30. We can also see once again the southeast has the largest IQR compared to other regions, but unlike other visuals a boxplot allows us to see outliers in the data. These can be seen with the dots placed above certain boxplots.

Column {.tabset data-wdith=550}
---

````{r}
library(ggplot2)
ggplot(insurance_data, aes(x = region, y = bmi)) +
  geom_boxplot(fill = "blue", color = "black") +
  labs(title = "Distribution of BMI Based on Region",
       x = "Region",
       y = "BMI") +
  theme_minimal()
````

Question Eight
===

column {data-width=450}
---

#8. Create a scatterplot that shows the relationship between age (independent variable) and charges (dependent variable). Comment on the scatterplot.

When looking at the scatter plot there seems to be a positive relationship between age and charges. This can be seen with how the dots move slightly vertical as they move to the right on the chart. It is not a huge shift so this could possibly mean the relationship is moderate or potentially weak. 

Column {.tabset data-wdith=550}
---

````{r}
library(ggplot2)
age_charges_scatter <- ggplot(insurance_data, aes(x = age, y = charges)) +
  geom_point(color = "blue") +
  labs(title = "Relationship Between Age and Charges",
       x = "Age",
       y = "Charges")
print(age_charges_scatter)
````

Question Nine
===

column {data-width=450}
---

#9. You should find that it seems "charges" could be classified into several groups. Let's create a scatterplot that has age as the independent variable (x) and has smoker as another categorical variable (color), and the response variable is charges. Comment on the scatterplot.

Compared to the other scatterplot the varying colors makes the data easier to read. Not only does it now show a postive relationship, but it specifies the relationship according to whether or not they are a smoker. By doing this it changes the colors, and allows a line of dots to be seen between smokers and nonsmokers near the 2000 charges area. 

Column {.tabset data-wdith=550}
---

````{r}
library(ggplot2)
age_charges_smoker_scatter <- ggplot(insurance_data, aes(x = age, y = charges, color = smoker)) +
  geom_point(alpha = 0.6) +
  labs(title = "Relationship Between Age, Charges, and Smoker Status",
       x = "Age",
       y = "Charges",
       color = "Smoker")
print(age_charges_smoker_scatter)
````

Question Ten
===

column {data-width=450}
---

#10. Now, create two data frames by subsetting insurance data as follows:

  smoker <- insurance[insurance$smoker=="yes"]
  
  nonsmoker <- insurance[insurance$smoker=="no"]
  
Column {.tabset data-wdith=550}
---

````{r}
smoker <- insurance_data[insurance_data$smoker == "yes", ]
print(smoker)
nonsmoker <- insurance_data[insurance_data$smoker == "no", ]
print(nonsmoker)
````

Question Eleven
===

column {data-width=450}
---

#11. Create a scatterplot that has age as the independent variable (x) and the response variable is charges using the data frame smoker. Then add the smooth line. Comment on the plot. Does it make sense to use the smooth line to summarize the relationship between age of clients and the corresponding charges? Why?

No because here we are using categorical data/variables, if the data/variables were continuous then it would make sense to use a smooth line to summarize the relationship.

Column {.tabset data-wdith=550}
---

````{r}
smoker_age_charges_plot <- ggplot(smoker, aes(x = age, y = charges)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Relationship Between Age and Charges for Smokers",
       x = "Age",
       y = "Charges") +
  theme_minimal()
print(smoker_age_charges_plot)
````

Question Twelve
===

column {data-width=450}
---

#12. Repeat Question 11 using the data frame nonsmoker.

Column {.tabset data-wdith=550}
---

````{r}
nonsmoker_age_charges_plot <- ggplot(nonsmoker, aes(x = age, y = charges)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Relationship Between Age and Charges for Non-Smokers",
       x = "Age",
       y = "Charges") +
  theme_minimal()
print(nonsmoker_age_charges_plot)
````

Question Thirteen
===

column {data-width=450}
---

#13. Based on the finding you have on Questions 11 & 12, propose what you might do next if you want to model charges using other variables in this data.

One area of interest I believe would be using the variable BMI. I feel that by uisng BMI it can find another variable that could have influence on the number of charges. This influence could be either a postive, negative, or no relationship.

Column {.tabset data-wdith=550}
---

Question Fourteen
===

column {data-width=450}
---

#14. Create a pie chart of children. Use a few sentences to summarize your finding based on the plot. (Hint: You need to convert the variable to a categorical variable first)

After observing the pie chart one may find that the two largest distributions are zero and one child as they take up about 60 to 70 percent of the pie chart. On the other hand 1 2, 3, 4 and more children were the lower distributions, taking up just about 30 to 40 percent of the pie chart.  

Column {.tabset data-wdith=550}
---

````{r}
insurance_data <- insurance_data %>%
  mutate(children_category = case_when(
    children == 0 ~ "0 children",
    children == 1 ~ "1 child",
    children == 2 ~ "2 children",
    children == 3 ~ "3 children",
    children >= 4 ~ "4 or more children"
  ))
children_pie_chart <- ggplot(insurance_data, aes(x = "", fill = children_category)) +
  geom_bar(width = 1) +
  coord_polar("y", start = 0) +
  labs(title = "Distribution of Number of Children",
       fill = "Number of Children") +
  theme_void() +
  theme(legend.position = "right")
print(children_pie_chart)
````

Question Fifteen
===

column {data-width=450}
---

#15. Create a boxplot that shows the distribution of charges based on the number of children. Discuss what you find based on the boxplot.

At first glance of the boxplot the first thing that stands out is the number of outliers in the data. This can be seen especially with zero children where there are numerous outliers, as shown with the dots on the boxplot. Along with zero children, one, two, and three children also had a concerning amount of outliers within the data. One may notice the similarities in data that having two or three children share as both share outliers, and have both similar IQR and ranges. After glancing at all the following IQR's one may also notice that most of the data for each number of child seems to be skewed left as most have their median near the bottom of their IQR. 

Column {.tabset data-wdith=550}
---

````{r}
charges_children_boxplot <- ggplot(insurance_data, aes(x = factor(children), y = charges)) +
  geom_boxplot() +
  labs(title = "Distribution of Charges Based on Number of Children",
       x = "Number of Children",
       y = "Charges") +
  theme_minimal()
print(charges_children_boxplot)
````